POV-Ray : Newsgroups : povray.off-topic : gathering infos from web pages : gathering infos from web pages Server Time
11 Oct 2024 05:22:14 EDT (-0400)
  gathering infos from web pages  
From: Fa3ien
Date: 21 Nov 2007 08:50:05
Message: <4744378d@news.povray.org>
Hi,
in our country, we have a government-operated official website
which publishes opportunities in public tender.

sample URL of a resulting page :
http://www.ejustice.just.fgov.be/cgi_bul/bul_a_1.pl?DETAIL=DETAIL&caller=list&row_id=1&numero=1&rech=472&numac=2007051997&pd=2007-11-16&lg=F&pdf_file=%2Fhome%2Fmon1%2Fbul%2Fimage%2F2007%2F1116_1.pdf&trier=+order+by+numac+desc%2C+pd%3B&language=fr&choix1=ET&choix2=ET&fromtab=BUL&sql=objet+contains++%27architecture%27&objet=architecture

These people don't have an RSS feed availiable, or anything else that would
help us do otherwise than tedious manual checking with keywords every week
(hundreds of offers are published each day).

I'd like to be able to automate the process, so I can produce some kind of
digest from the offers we are likely interested into. The "examine a page
and determine if we are likely to be interested" will be easy.  I have a
problem with the first step : "automatically retrieve every page starting
from a given one".

After some observation and tests, I know how to get the "next offer" by tweaking
the URL string appropriately.  But I need to read the content of the resulting
page to determine when I need to change the date ('pd' in the query) so I can
continue incrementing the numbering ('numac' in query).  That's where it goes bad.

I tought 'well, just do some javascript, put the content of the url in an iframe,
read it, and act accordingly'.  Done that. It doesn't work. Why ?  The XMLHTTPRequest
function, which is used to put the content of the iframe in a string, is prohibited
(in any browser in existence) to work with content from another domain. Ouch !
I found some GreaseMonkey script which pretended to allow bypass of this "cross-domain
policy", but it didn't work.

So I'm still at the start of this seemingly simple project.  I'm currently thinking
of getting the pages with WGET, but can I pilot WGET from Javascript ? Or should
I try another language ?  Or a completely different path ?

Ideas ?

TIA,
Fabien.


Post a reply to this message

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.